NetApp: wait for volume to become RW after snapmirror break#299
NetApp: wait for volume to become RW after snapmirror break#299Carthaca wants to merge 2 commits intostable/2023.1-m3from
Conversation
|
just a thought, snapmirror relationship status can also be checked to be "broken-off" |
|
Another thought: we could mount the volume while it’s still in DP mode once promote is selected, and then perform the SnapMirror break. |
|
is it allowed, afai remember the mount was not allowed to DP type vol, the junction path can be applied to once the vol is no more DP, maybe something change in new releases |
Yes, adding junction-path to the DP volume is possible both before and after the initial transfer. However, data access through the junction path is only permitted once the baseline copy has completed. |
|
yes, that was the confusion and I barely remember, during the initial transfer it use to give some warning. @Carthaca , since we offer RO replicas, the mount operation must have been kicked earlier, is it only for the case where the user creates the access rule for the replica/destination? |
During replica promotion replicas failed with mount errors after snapmirror break operations. NetApp audit logs showed that the break commands were issued but remained in "Pending" state while Manila immediately attempted to mount the volumes. The mounts failed because the volumes were still DP type - the break operations hadn't completed yet. The break_snapmirror method previously assumed break_snapmirror_vol() was synchronous and the volume would immediately be RW. In practice, the break operation can take several seconds to complete. This adds a polling loop after break_snapmirror_vol() that waits for the volume type to transition from 'dp' to 'rw' before attempting the mount. The implementation mirrors the existing wait_for_quiesced logic, using the netapp_snapmirror_quiesce_timeout config with 5-second intervals. If the volume doesn't become RW within the timeout, a NetAppException is raised with details about the timeout and volume name. Additionally, this optimizes promotion for readable replicas by skipping the mount operation entirely. Readable replicas are already mounted when created (with junction path) since they need to be accessible for read operations. Only DR replicas need mounting after snapmirror break. Change-Id: I2b8f9a1c5d7e3a4f6b9c8d1e2f3a4b5c6d7e8f9a Signed-off-by: Maurice Escher <maurice.escher@sap.com>
6b5f68f to
c314414
Compare
|
@sumitarora2786 yes, the additional mount is not needed in the readable replica case - I made that optimization to the code. The both sides DP problem was caused by periodic update trying to "repair" the original snapmirror ( |
…tion During replica promotion, the API layer sets the replica status to STATUS_REPLICATION_CHANGE and both promote_share_replica and _share_replica_update use the @locked_share_replica_operation decorator to prevent concurrent execution via a shared lock. However, after a promotion failure, a sequential race can occur: 1. API sets replica status to STATUS_REPLICATION_CHANGE 2. Promote operation acquires lock and starts 3. Promote fails (e.g., mount error after snapmirror break) 4. Exception handler sets both replicas to ERROR status 5. Promote releases lock and exits 6. periodic_share_replica_update acquires lock shortly after 7. Sees both replicas in ERROR, but no snapmirror relationship 8. Attempts to recreate snapmirror based on stale database state 9. Creates relationship in wrong direction (old source -> old dest) This adds two safeguards in _share_replica_update: 1. Explicitly skip replicas with STATUS_REPLICATION_CHANGE status. While the lock prevents concurrent execution during promotion, this provides defense-in-depth and makes the intent explicit. STATUS_REPLICATION_CHANGE is intentionally NOT added to TRANSITIONAL_STATUSES as it has special handling in the Share model's instance property for replica selection ordering. 2. If both the active replica and target replica are in ERROR status, skip the driver update entirely. This prevents automatic "recovery" after failed critical operations that require manual intervention. Without this, periodic updates recreate snapmirror relationships in incorrect directions after failed promotions. The checks are placed in the share manager (not the driver) as they are policy decisions about when to skip automatic operations. Change-Id: I3c7d9b2e8f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c Signed-off-by: Maurice Escher <maurice.escher@sap.com>
c314414 to
0a1f256
Compare
During replica promotion two replicas failed with mount errors after
snapmirror break operations. NetApp audit logs showed that the break
commands were issued but remained in "Pending" state while Manila
immediately attempted to mount the volumes. The mounts failed because
the volumes were still DP type - the break operations hadn't completed
yet.
The break_snapmirror method previously assumed break_snapmirror_vol()
was synchronous and the volume would immediately be RW. In practice,
the break operation can take several seconds to complete.
This adds a polling loop after break_snapmirror_vol() that waits for
the volume type to transition from 'dp' to 'rw' before attempting the
mount. The implementation mirrors the existing wait_for_quiesced logic,
using the netapp_snapmirror_quiesce_timeout config with 5-second
intervals.
If the volume doesn't become RW within the timeout, a NetAppException
is raised with details about the timeout and volume name.